Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

54 ◾ Bioinformatics

Notice that the name of the reference genome or its URL may change in the future. The

above commands will create the directory “refgenome” where it downloads the human

reference genome and decompresses it. Once the reference genome has been downloaded,

you can use “samtools faidx” to index it as follows:

samtools faidx GRCh38.p13_ref.fna

To read more about this command, run “samtools faidx -h”.

Indeed, for the reference sequence to be indexed by “samtools faidx” command, it must

be in FASTA format and well-formatted, which means that the FASTA sequences contained

in the file must have a unique name or ID in the FASTA defline and the sequence lines of

each sequence must be of the same length. Indexing a reference genome with Samtools

enables efficient access to arbitrary regions within the FASTA file of the reference sequence.

The above “samtools faidx” command creates an index file “GRCh38.p13_ref.fna.fai”

for the reference genome with the same name as that of the reference genome but with

“.fai” appended to the file name. For the FASTA file, an fai index file is a text file consist-

ing of lines, each with five TAB-delimited columns, including NAME (name of this refer-

ence sequence), LENGTH (length of sequence), OFFSET (sequence’s first base in bytes),

LINEBASES (the number of bases on each line), and LINEWIDTH (the number of bytes

in each line) as shown in Figure 2.4.

Remember that before you use a reference genome with any aligner, you must index it

with “samtools faidx” as above, and the FASTA file and the index file must be in the same

directory. In some reference genome sequence, the sequence names are labeled by chromo-

somes (e.g., chr1) instead of accession numbers.

In the following, we will discuss the commonly used algorithms for read alignments

and the popular aligners.

FIGURE 2.4 Part of the fai index file of the human reference genome.